Acta Psychiatrica Scandinavica
○ Wiley
All preprints, ranked by how well they match Acta Psychiatrica Scandinavica's content profile, based on 10 papers previously published here. The average preprint has a 0.01% match score for this journal, so anything above that is already an above-average fit. Older preprints may already have been published elsewhere.
Fabian Eitel; Sebastian Stober; Lea Waller; Lena Dorfschmidt; Henrik Walter; Kerstin Ritter
Show abstract
The authors have withdrawn this manuscript because the results were posted in error. The authors do not wish this work to be cited as reference for the project. Please contact the corresponding author if you have any questions.
Atwood, B.; Holderness, E.; Verhagen, M.; Shinn, A. K.; Cawkwell, P.; Cerruti, H.; Pustejovsky, J.; Hall, M.-H.
Show abstract
Psychiatric electronic health records present unique challenges for machine learning due to their unstructured, complex, and variable nature. This study aimed to create a gold standard dataset by identifying a cohort of patients with psychotic disorders and posttraumatic stress disorder, (PTSD), developing clinically-informed guidelines for annotating traumatic events in their health records and to create a gold standard publicly available dataset, and demonstrating the datasets suitability for training machine learning models to detect indicators of symptoms, substance use, and trauma in new records. We compiled a representative corpus of 200 narrative heavy health records (470,489 tokens) from a centralized database and developed a detailed annotation scheme with a team of clinical experts and computational linguistics. Clinicians annotated the corpus for trauma-related events and relevant clinical information with high inter-annotator agreement (0.715 for entity/span tags and 0.874 for attributes). Additionally, machine learning models were developed to demonstrate practical viability of the gold standard corpus for machine learning applications, achieving a micro F1 score of 0.76 and 0.82 for spans and attributes respectively, indicative of their predictive reliability. This study established the first gold-standard dataset for the complex task of labelling traumatic features in psychiatric health records. High inter-annotator agreement and model performance illustrate its utility in advancing the application of machine learning in psychiatric healthcare in order to better understand disease heterogeneity and treatment implications.
Mineur, L.; Heide, M.; Eickhoff, S.; Avram, M.; Franzen, L.; Buschmann, F.; Schroepfer, F.; Rogg, H. V.; Andreou, C.; Bruegge, N.; Handels, H.; Borgwardt, S.; Korda, A.
Show abstract
Mental health research increasingly focuses on the relationship between psychiatric symptoms and observable manifestations of the face and body1. In recent studies2,3, psychiatric patients have shown distinct patterns in movement, posture and facial expressions, suggesting these elements could enhance clinical diagnostics. The analysis of the facial expressions is grounded on the Facial Action Coding System (FACS)4. FACS provides a systematic method for categorizing facial expressions based on specific muscle movements, enabling detailed analysis of emotional and communicative behaviors. This method combined with recent advancements in Artificial Intelligence (AI) has shown promising results for the detection of the patient mental state. We analyze video data from patients with various psychiatric symptoms, using open-source Python toolboxes for facial expression and body movement analysis. These toolboxes facilitate face detection, facial landmark detection, emotion detection and motion recognition. Specifically, we aim to explore the connection between these physical expressions and established diagnostic tools, like symptom severity scores, and finally enhance psychiatric diagnostics by integrating AI-driven analysis of video data. By providing a more objective and detailed understanding of psychiatric symptoms, this study could lead to earlier detection and more personalized treatment approaches, ultimately improving patient outcomes. The findings will contribute to the development of innovative diagnostic tools that are both efficient and accurate, addressing a critical need in mental health care.
Korda, A.
Show abstract
Suicide attempts are one of the most challenging psychiatric outcomes and have great importance in clinical practice. However, they remain difficult to detect in a standardised way to assist prevention because assessment is mostly qualitative and often subjective. As digital documentation is increasingly used in the medical field, Electronic Health Records (EHRs) have become a source of information that can be used for prevention purposes, containing codified data, structured data, and unstructured free text. This study aims to provide a quantitative approach to suicidality detection using EHRs, employing natural language processing techniques in combination with deep learning artificial intelligence methods to create an algorithm intended for use with medical documentation in German. Using psychiatric medical files from in-patient psychiatric hospitalisations between 2013 and 2021, free text reports will be transformed into structured embeddings using a German trained adaptation of Word2Vec, followed by a Long-Short Term Memory (LSTM) - Convolutional Neural Network (CNN) approach on sentences of interest. Text outside the sentences of interest will be analysed as context using a fixed size ordinally-forgetting encoding (FOFE) before combining these findings with the LSTM-CNN results in order to label suicide related content. This study will offer promising ways for automated early detection of suicide attempts and therefore holds opportunities for mental health care.
Ehrig, L.; Wagner, A.-C.; Wolter, H.; Correll, C. U.; Geisel, O.; Konigorski, S.
Show abstract
Fetal alcohol-spectrum disorder (FASD) is underdiagnosed and often misdiagnosed as attention-deficit/hyperactivity disorder (ADHD). Here, we developed a screening tool for FASD in youth with ADHD symptoms. To develop the prediction model, medical record data from a German University outpatient unit were assessed including 275 patients aged 0-19 years old with FASD with or without ADHD and 170 patients with ADHD without FASD aged 0-19 years old. We trained 6 machine learning models based on 13 selected variables and evaluated their performance. Random forests yielded the best prediction models with a cross-validated AUC of 0.92 (95% confidence interval [0.84, 0.99]). Follow-up analyses indicated that a random forest model with 6 variables - body length and head circumference at birth, IQ, socially intrusive behaviour, poor memory and sleep disturbance - yielded equivalent predictive accuracy. We implemented the prediction model in a web-based app called FASDetect - a user-friendly, clinically scalable FASD risk calculator that is freely available at https://fasdetect.dhc-lab.hpi.de.
Wagner, M.; Jagayat, J.; Kumar, A.; Shirazi, A.; Alavi, N.; Omrani, M.
Show abstract
Mental health is in a state of crisis with demand for mental health services significantly surpassing available care. As such, building scalable and objective measurement tools for mental health evaluation is of primary concern. Given the usage of spoken language in diagnostics and treatment, it stands out as potential methodology. Here a model is built for mental health status evaluation using natural language processing. Specifically, a RoBERTa-based model is fine-tuned on text from psychotherapy sessions to predict mental health status with prediction accuracy on par with clinical evaluations at 74%.
Adhikary, P. K.; Singh, S.; Singh, S.; Sharma, P.; Soni, P.; Choudhary, R.; Saxena, C.; Chauhan, P.; Gupta, S. K.; Deb, K. S.; Singh, S. M.; Chakraborty, T.
Show abstract
Psychotherapy note-making is crucial for effective patient care. However, traditional formats such as SOAP (Subjective, Objective, Assessment, and Plan) and BIRP (Behavior, Intervention, Response, and Plan) often fail to capture the nuanced complexities of therapeutic sessions, as they primarily focus on surface-level details and lack a comprehensive understanding of the patients history, mental status, and therapeutic process. While recent advances in Artificial Intelligence (AI) and Large Language Models (LLMs) show promise in clinical documentation, their application in psychotherapy note summarisation remains unexplored. We present iCARE (identifiers, Chief Concerns and Clinical History, Assessment and Analysis, Risk and Crisis, Engagement and Next Steps), a comprehensive framework for AI-assisted psychotherapy documentation that addresses these limitations. iCARE comprises of 17 clinically relevant aspects, developed collaboratively with mental health professionals, and aligned with established guidelines. We further introduce PATH (Psychotherapy Aspects and Treatment History summary), a novel dataset of annotated therapy sessions. Through extensive benchmarking with 11 LLMs, including both open and closed-source models, we evaluate their performance across different note-taking aspects using automatic and human evaluation metrics. Our results show that closed-source models like Gemini Pro and GPT4o-mini excel in various aspects, with Gemini Pro achieving superior human evaluation scores. Notably, all models struggle with temporal reasoning and complex therapeutic interpretations. The findings suggest that current LLMs can assist in basic documentation but require improvements in handling longitudinal therapeutic relationships and aspects that require deeper clinical understanding and interpretative reasoning. This work advances mental health care documentation while emphasising the need for continued clinical expertise in psychotherapy note summarisation.
Hua, Y.; Blackley, S. V.; Shinn, A. K.; Skinner, J. P.; Moran, L. V.; Zhou, L.
Show abstract
Early and accurate diagnosis is crucial for effective treatment and improved outcomes, yet identifying psychotic episodes presents significant challenges due to its complex nature and the varied presentation of symptoms among individuals. One of the primary difficulties lies in the underreporting and underdiagnosis of psychosis, compounded by the stigma surrounding mental health and the individuals often diminished insight into their condition. Existing efforts leveraging Electronic Health Records (EHRs) to retrospectively identify psychosis typically rely on structured data, such as medical codes and patient demographics, which frequently lack essential information. Addressing these challenges, our study leverages Natural Language Processing (NLP) algorithms to analyze psychiatric admission notes for the diagnosis of psychosis, providing a detailed evaluation of rule-based algorithms, machine learning models, and pre-trained language models. Additionally, the study investigates the effectiveness of employing keywords to streamline extensive note data before training and evaluating the models. Analyzing 4,617 initial psychiatric admission notes (1,196 cases of psychosis versus 3,433 controls) from 2005 to 2019, we discovered that the XGBoost classifier employing Term Frequency-Inverse Document Frequency (TF-IDF) features derived from notes pre-selected by expert-curated keywords, attained the highest performance with an F1 score of 0.8881 (AUROC [95% CI]: 0.9725 [0.9717, 0.9733]). BlueBERT demonstrated comparable efficacy an F1 score of 0.8841 (AUROC [95% CI]: 0.97 [0.9580, 0.9820]) on the same set of notes. Both models markedly outperformed traditional International Classification of Diseases (ICD) code-based detection methods from discharge summaries, which had an F1 score of 0.7608, thus improving the margin by 0.12. Furthermore, our findings indicate that keyword pre-selection markedly enhances the performance of both machine learning and pre-trained language models. This study illustrates the potential of NLP techniques to improve psychosis detection within admission notes and aims to serve as a foundational reference for future research on applying NLP for psychosis identification in EHR notes.
Ahadian, P.; Fragano, A.; Guan, T.; Guan, Q.; Shalhout, S. Z.
Show abstract
BackgroundDepression is a leading cause of global disability. Timely identification of patients at risk for clinical worsening remains a major challenge. Electronic health records (EHRs) facilitate large-scale, real-world analyses of disease trajectories. However, standardized symptom scale data such as the Patient Health Questionnaire-9 are often unavailable or recorded only as unstructured text. In this context, International Classification of Diseases (ICD10) diagnostic-code based severity progression provides a pragmatic alternative for developing predictive tools to identify worsening depression. ObjectiveWe aim to develop and evaluate machine-learning and deep-learning models for predicting ICD10-defined progression from mild to moderate/severe depression using EHR data curated by the MedStar Health Research Institute (MHRI). MethodsWe conducted a multi-institutional retrospective cohort analysis using the MHRI EHR database, which integrates data from 10 hospitals and 300 outpatient sites across the mid-Atlantic. Adults ([≥]18 years) with an initial ICD10 diagnosis of mild depression between 2017 and 2023 were included (N=2131). Nonprogressors were defined as patients whose mild major depressive disorder remained mild for 24 months (N=270). Progressors were defined as patients who developed moderate or severe ICD10 depression within 24 months of the index diagnosis (N=533). Data were stratified and split into (60%) training, (20%) validation, and (20%) test subsets. A heterogeneous feature set spanning demographics, healthcare utilization, socioeconomic indices, diagnostic context, and laboratory measurements were available. Logistic regression utilized elastic net regularization with fivefold cross validation, and random forest hyperparameters were tuned by grid search. XGBoost, CatBoost, and a deep neural network (DNN) were trained with standard learning rate, depth, class weighting, and early stopping. A deterministic top model selection framework applied prespecified thresholds of sensitivity at least 0.70 and AUC at least 0.70, and composite rankings integrated accuracy, sensitivity, specificity, and the overfitting gap. ResultsThe analytic cohort included 803 patients with complete two-year follow-up. Under the selection criteria, the DNN failed to meet the AUC threshold (0.671) and was excluded. Among the remaining models, XGBoost achieved the top composite score (accuracy = 0.72; AUC = 0.776; sensitivity = 0.77; specificity = 0.63; overfit gap = 0.112). Logistic regression ranked second (accuracy = 0.71; AUC = 0.797; sensitivity = 0.79; specificity = 0.61; overfit gap = 0.052), followed by CatBoost and random forest, the latter penalized for overfitting (gap = 0.278). The TinyLlama audit note, generated through a local Hugging Face pipeline, confirmed XGBoost as the most balanced model. ConclusionsUsing EHR data from a multi-institutional regional health system, we developed and validated machine-learning models that predicted progression of depression. XGBoost demonstrated the most reliable composite performance. These findings support the feasibility of leveraging socioeconomic and EHR data to predict worsening depression and emphasize the importance of transparent model-selection frameworks for trustworthy clinical artificial intelligence.
Verhees, F. G.; Huth, F.; Meyer, V.; Wolf, F.; Bauer, M.; Pfennig, A.; Ritter, P.; Kather, J. N.; Wiest, I. C.; Mikolas, P.
Show abstract
BackgroundPrompt engineering has the potential to enhance large language models (LLM) ability to solve tasks through improved in-context learning. In clinical research, the use of LLMs has shown expert-level performance for a variety of tasks ranging from pathology slide classification to identifying suicidality. We introduce clickBrick, a modular prompt-engineering framework, and rigorously test its effectiveness. MethodsHere, we explore the effects of increasingly structuring prompts with the clickBrick framework for a comprehensive psychopathological assessment of 100 index patients from psychiatric electronic health records. We compare the performance of a locally-run LLM (Llama-3.1-70B-Instruct) against an expert-labelled ground truth for a variety of successively built-up prompts for the extraction of 12 transdiagnostic psychopathological criteria. Potential clinical value was explored by training linear support vector machines on outputs from the strongest and weakest prompts to predict discharge ICD-10 main diagnoses for a historical sample of 1,692 patients. OutcomesWe could reliably extract information across 12 distinct psychopathological classification tasks from unstructured clinical text with balanced accuracies spanning 71 % to 94%. Across tasks, we observed a substantially improved extraction accuracy (between +19% and +36%) using clickBrick. The comparison unveiled great variations between prompts with a reasoning prompt performing best in 7 out of 12 domains. Clinical value and internal validity were approximated by downstream classification of eventual psychiatric diagnoses for 1,692 patients. Here, clickBrick led to an improvement in overall classification accuracy from 71% to 76%. InterpretationClickBrick prompt engineering, i.e. iterative, expert-led design and testing, is critical for unlocking LLMs clinical potential. The framework offers a reproducible pathway for deploying trustworthy generative AI across mental health and other clinical fields. FundingThe German Ministry of Research, Technology and Space and the German Research Foundation. Research in contextO_ST_ABSEvidence before this studyC_ST_ABSWe searched PubMed/MEDLINE articles without language restrictions published before June 25 2025 that combined three concept blocks - "prompt engineering" or related synonyms, "large language model/LLM" or specific model names (e.g., ChatGPT, GPT-4, LLaMA), and psychiatric or mental-health terms (e.g., psychiatry, psychotherapy, depression, anxiety). Additionally, we asked ChatGPT o3 to design and execute a systematic review strategy to also capture not-yet peer-reviewed but relevant pre-prints, given only our manuscript title. After manual de-duplication and abstract screening, three out of 23 identified studies did offer at least some information on their prompting strategies and were conducted on real-world clinical data from psychotherapy transcripts (one study on multi-dimensional counselling therapy, no peer review), or on online patient portal queries (two peer-reviewed studies on (a) empathy evaluation, and (b) provider satisfaction and use of generated response with partial integration with electronic health records. Neither systematically structured their prompts in a transparent way, nor tested reasoning prompts. Beyond psychiatry, one study analyzing automated echocardiography reports did employ a comparison between two different prompts and an expert-led design strategy. A single study used structured and transparent prompt engineering to generate automated responses for simulated problem-solving therapy sessions. None of the highlighted studies reported both head-to-head comparisons of competing prompt strategies for full reproducibility, and their application in real-world care, e.g. on electronic health records. Collectively, the existing literature suggests growing interest but reveals a paucity of rigorous evidence on how prompt engineering impacts large language model performance in clinical psychiatry, particularly in real-world settings. Added value of this studyWe demonstrate reliable information extraction from electronic health records across 12 distinct psychopathological classification tasks from unstructured clinical text and substantially improved extraction accuracy (between +19% and +36%) using clickBrick, our prompt engineering framework. The rationale for such an approach is justified by the surprising identification of zero-shot, few-shot and reasoning prompts as the best performing prompts for different tasks, while a Chain-of-Thought reasoning prompt performs best in 7 out of 12 tasks. And while most studies rely on proprietary language models like openAIs ChatGPT, our locally run version of a popular open-weight model (Llama-3.1-70B-Instruct) allows for privacy safeguarding of sensitive patient data, which is essential for ethical clinical application. Implications of all the available evidenceGenerative artificial intelligence is poised to benefit psychiatric patients greatly, powering advances from therapy delivery to decision support and patient outreach. Rigorous prompt engineering with tools like clickBrick heightens their reliability and credibility, making clickBrick a cornerstone for bringing AI into everyday psychiatric care.
Mooney, M. A.; Neighbor, C.; Karalunas, S.; Dieckmann, N. F.; Nikolas, M.; Nousen, E.; Tipsord, J.; Song, X.; Nigg, J. T.
Show abstract
Proper diagnosis of ADHD is costly, requiring in-depth evaluation via interview, multi-informant and observational assessment, and scrutiny of possible other conditions. The increasing availability of data may allow the development of machine-learning algorithms capable of accurate diagnostic predictions using low-cost measures. We report on the performance of multiple classification methods used to predict a clinician-consensus ADHD diagnosis. Classification methods ranged from fairly simple (e.g., logistic regression) to more complex (e.g., random forest), and also included a multi-stage Bayesian approach. All methods were evaluated in two large (N>1000), independent cohorts. The multi-stage Bayesian classifier provides an intuitive approach that is consistent with clinical workflows, and is able to predict ADHD diagnosis with high accuracy (>86%)--though not significantly better than other commonly used classifiers, including logistic regression. Results suggest that data from parent and teacher surveys is sufficient for high-confidence classifications in the vast majority of cases using relatively straightforward methods.
Varone, G.; Kumar, P.; Brown, J.; Boulila, W.
Show abstract
Psychiatric disorders are fundamentally challenged by symptom heterogeneity, high comorbidity, and the absence of objective biomarkers, which together result in substantial variability in clinical assessment and treatment selection. Patient-generated language captures rich information about subjective experience and symptom severity, which can be systematically encoded and analyzed using computational models, making it a scalable signal for psychiatric assessment. We compare two approaches: (i) a domain-specialized transformer fine-tuned on clinical language, based on the Bio-ClinicalBERT encoder architecture, and (ii) a large-scale instruction-tuned generalist encoder (Instructor-XL) used as a frozen feature extractor with a shallow classification head. A corpus of N = 151,228 de-identified texts was compiled from five public sources, covering four psychiatric phenotypes: anxiety, depression, schizophrenia, and suicidal intention. Models were evaluated using stratified 10-fold cross-validation with cost-sensitive training, prioritizing imbalance-aware metrics, including Macro-F1 and Matthews Correlation Coefficient (MCC), over accuracy. Bio-ClinicalBERT achieved superior overall performance (Macro-F1 = 0.78, MCC = 0.6752), indicating more reliable separation of diagnostically overlapping affective categories. In contrast, Instructor-XL achieved its highest class-specific performance for schizophrenia (F1 = 0.798). Explainability analyses suggest that the domain-specialized model places greater weight on clinically relevant terms, whereas the generalist model relies on a broader set of lexical features.
Trivedi, S.; Simons, N. W.; Tyagi, A.; Ramaswamy, A.; Nadkarni, G. N.; Charney, A. W.
Show abstract
Background: Large language models (LLMs) are increasingly used in mental health contexts, yet their detection of suicidal ideation is inconsistent, raising patient safety concerns. Objective: To evaluate whether an independent safety monitoring system improves detection of suicide risk compared with native LLM safeguards. Methods: We conducted a cross-sectional evaluation using 224 paired suicide-related clinical vignettes presented in a single-turn format under two conditions (with and without structured clinical information). Native LLM safeguard responses were compared with an independent supervisory safety architecture with asynchronous monitoring. The primary outcome was detection of suicide risk requiring intervention. Results: The supervisory system detected suicide risk in 205 of 224 evaluations (91.5%) versus 41 of 224 (18.3%) for native LLM safeguards. Among 168 discordant evaluations, 166 favored the supervisory system and 2 favored the LLM (matched odds ratio {approx}83.0). Both systems detected risk in 39 evaluations, and neither in 17. Detection was highest in scenarios with explicit suicidal ideation and lower in more ambiguous presentations. Conclusions: Native LLM safeguards frequently failed to detect suicide risk in this structured evaluation. An independent monitoring approach substantially improved detection, supporting the role of external safety systems in high-risk mental health applications of LLMs.
Kolding, S.; Damgaard, J. G.; Bernstorff, M.; Hansen, L.; Ostergaard, S. D.; Danielsen, A. A.
Show abstract
IntroductionUse of coercive measures in psychiatric hospitals is clinically and ethically challenging. Aiming to support prevention, we developed and evaluated machine learning models to predict both mechanical restraint and a broader composite outcome that includes related coercive measures. MethodsThe dataset comprised electronic health records (EHR) from adults ([≥]18 years) who had at least one admission to the Psychiatric Services in the Central Denmark Region between 2015 and 2021. For each inpatient day, XGBoost machine learning models were trained to predict mechanical restraint or composite (mechanical, chemical, or manual) restraint within 48 hours. Hyperparameters were optimised for the area under the receiver operating characteristic curve (AUROC) using five-fold cross validation on 85% of the data, with performance validated on a held-out 15% test set. ResultsThe cohort included 16,834 patients with 45,179 inpatient stays, covering 687,388 prediction days. Of these, 2,736 days were followed by a restraint episode within 48 hours, including 983 episodes of mechanical restraint. The final models were trained on 2,389 EHR-based predictors, derived from demographics, diagnoses, medications, and clinical notes. The mechanical restraint model achieved an AUROC of 0.921 (95% CI: [0.918-0.922]) and a positive predictive value (PPV) of 4.9% when classifying the top 1% of risk scores as positive. The composite model achieved an AUROC of 0.912 (95% CI: [0.909-0.913]) and a PPV of 4.2% when predicting mechanical restraint, and 0.900 (95% CI: [0.898-0.900]) with a PPV of 10.4% when predicting composite restraint. ConclusionThe results indicate that incorporating related coercive measures into model training did not improve discrimination (AUROC) for predicting mechanical restraint but did increase PPV when predicting composite restraint, reflecting the higher outcome prevalence. This suggests that leveraging related outcomes can inform prediction of rare events, emphasising the importance of problem framing in clinical prediction modelling. Future work should include external validation across temporal, geographic, and demographic contexts. Significant Outcomes- A machine learning model trained solely for predicting mechanical restraint achieved strong performance (AUROC 0.92), identifying nearly one-third of restraint cases at high specificity. - Training on a broader composite outcome yielded similar discriminatory performance when predicting mechanical restraint, while the higher base rate resulted in a higher positive predictive value for predicting composite restraint. - Broadening the outcome to include multiple restraint types increased the number of at-risk patients detected due to the higher prevalence, without compromising accuracy for mechanical restraint, supporting shared underlying risk factors. Limitations- The model requires more extensive external validation to assess generalisability across time, demographic groups, and settings, which may be limited by regional/national differences in legislation and clinical documentation. - Prediction performance was highest near the restraint event, limiting early forecasting and suggesting that limiting predictions to the early phase of hospitalisation, where most restraint occurs, could elevate the base rate and improve model performance.
Mali, Y.; Zeng, Z.; Heo, K.; Zhang, G.; Chen, J.; Keramatian, K.; Saraf, G.; Solmi, M.; Tam, E.; Parikh, S.; Schaffer, A.; Beaulieu, S.; Ng, R.; Yatham, L. N.; Nunez, J.-J.
Show abstract
ObjectiveClinical practice guidelines support evidence-based care but are often underused due to complexity, time constraints, and navigation challenges. We investigated whether a conversational agent (chatbot) using an open-weight large language model (LLM) with retrieval-augmented generation (RAG) could provide guideline-consistent answers for bipolar disorder management based on the full 2018 CANMAT and ISBD guidelines, comparing against a system using only the base LLM. MethodWe developed a multi-step RAG-based chatbot that retrieves relevant guideline sections and generates responses using Llama 3.3 70B. Twenty-one clinical vignettes spanning all guideline sections were created. Six expert psychiatrists generated queries and were presented with paired responses without labels from two systems: one using the base Llama 3.3 70B model, the other RAG-enhanced. Responses rated guideline consistency on a three-point scale, and were analyzed using mixed-effects ordinal logistic regression. ResultsExperts evaluated 126 responses, of which 110 (87.3%) were rated as more or as correct as the baseline system. The RAG system produced 80 answers (63.5%) rated fully consistent with the guidelines versus 24 (19.0%) for baseline, and only 10 answers with major deviation (7.9%) versus 48 (38.1%) for baseline. Ordinal regression showed RAG responses were significantly more likely to be more correct (OR = 9.1, 95% CI 5.3-16.3, p < 0.001), which was consistent across all raters. Preference ratings favored RAG answers in 78.7% of cases. Performance varied by vignette, with some errors in both retrieval and reasoning. ConclusionThe use of RAG with an open-weight model helped produce answers consistent with the CANMAT guidelines across vignettes that required adapting or combining guideline text, suggesting viability of a bipolar guideline chatbot. We identified areas to improve results and evaluation. Future work should explore additional retrieval strategies and LLMs, and test in more naturalistic settings.
Morosoli, J. J.; Lind, P. A.; Spears, K.; Pratt, G.; Medland, S. E.
Show abstract
This study examined arrays offered by commercial pharmacogenomic (PGx) testing services for mental health care in Australia and the United States, with a focus on utility for non-European populations. Seven of the 14 testing services we identified provided the manifests of their arrays. We examined allele frequencies for each variant using data from the Allele Frequency Aggregator1 (ALFA), genome Aggregation Database2 (gnomAD), Exome Aggregation Consortium2 (ExAC), and Japanese Multi Omics Reference Panel3, and examined genetic heterogeneity. We also analyzed meta-data from the Pharmacogenomic Knowledge Base4 (PharmGKB) and explored the biogeographical origin of supporting evidence for clinical annotations. Most arrays included the minimum allele set recommended by Bousman et al5. However, few arrays included HLA-A or HLA-B. The most diverse allele frequencies were seen for variants in CYP3A5, ADRA2A and GNB3, with European and African populations showing the largest differences. Most evidence listed in PharmGKB originated from European or unknown ancestry samples.
Frydman-Gani, C.; Arias, A.; Perez Vallejo, M.; Londono Martinez, J. D.; Valencia-Echeverry, J.; Castano, M.; Bui, A. A. T.; Freimer, N. B.; Lopez-Jaramillo, C.; Olde Loohuis, L. M.
Show abstract
The accurate detection of clinical phenotypes from electronic health records (EHRs) is pivotal for advancing large-scale genetic and longitudinal studies in psychiatry. Free-text clinical notes are an essential source of symptom-level information, particularly in psychiatry. However, the automated extraction of symptoms from clinical text remains challenging. Here, we tested 11 open-source generative large language models (LLMs) for their ability to detect 109 psychiatric phenotypes from clinical text, using annotated EHR notes from a psychiatric clinic in Colombia. The LLMs were evaluated both "out-of-the-box" and after fine-tuning, and compared against a traditional natural language processing (tNLP) method developed from the same data. We show that while base LLM performance was poor to moderate (0.2-0.6 macro-F1 for zero-shot; 0.2-0.74 macro-F1 for few shot), it improved significantly after fine-tuning (0.75-0.86 macro-F1), with several fine-tuned LLMs outperforming the tNLP method. In total, 100 phenotypes could be reliably detected (F1>0.8) using either a fine-tuned LLM or tNLP. To generate a fine-tuned LLM that can be shared with the scientific and medical community, we created a fully synthetic dataset free of patient information but based on original annotations. We fine-tuned a top-performing LLM on this data, creating "Mistral-small-psych", an LLM that can detect psychiatric phenotypes from Spanish text with performance comparable to that of LLMs trained on real EHR data (macro-F1=0.79). Finally, the fine-tuned LLMs underwent an external validation using data from a large psychiatric hospital in Colombia, the Hospital Mental de Antioquia, highlighting that most LLMs generalized well (0.02-0.16 point loss in macro-F1). Our study underscores the value of domain-specific adaptation of LLMs and introduces a new model for accurate psychiatric phenotyping in Spanish text, paving the way for global precision psychiatry.
Gountouna, V.-E.; Bermingham, M.; Kuznetsova, K.; Urda Munoz, D.; Agakov, F.; Robson, S.; Meijsen, J.; Campbell, A.; Hayward, C.; Wigmore, E.; Clarke, T.; Fernandez, A. M.; MacIntyre, D.; McKeigue, P. M.; Porteous, D.; Nicodemus, K.
Show abstract
Depression is a common psychiatric disorder with substantial recurrence risk. Accurate prediction from easily collected data would aid in diagnosis, treatment and prevention. We used machine learning in the Generation Scotland cohort to predict lifetime risk of depression and, among cases, recurrent depression. Rank aggregation was used to combine results across ten different algorithms and identify highly predictive variables. The model containing all but the cardiometabolic predictors had the highest predictive ability on independent data. Rank aggregation produced a reduced set of predictors without decreasing predictive performance (lifetime: 20 out of 154 predictors and Receiver Operating Characteristic area under the curve (AUC)=0{middle dot}84, recurrent: 10 out of 180 predictors and AUC=0{middle dot}76). Here we develop a pipeline which leads to a small set of highly predictive variables. This information can be easily collected with a smartphone application to help diagnosis and treatment, while longitudinal tracking may help patients in self-management. SignificanceDepression is the most common psychiatric disorder and a leading cause of disability worldwide. Patients are often diagnosed and treated by non-specialist clinicians who have limited time available to assess them. We present a novel methodology which allowed us to identify a small set of highly predictive variables for a diagnosis of depression, or recurrent depression in patients. This information can easily be collected using a tablet or smartphone application in the clinic to aid diagnosis.
Richter, M.; Emden, D.; Leenings, R.; Winter, N. R.; Mikolajczyk, R.; Massag, J.; Zwiky, E.; Borgers, T.; Redlich, R.; Koutsouleris, N.; Falguera, R.; Edwin Thanarajah, S.; Padberg, F.; Reinhard, M. A.; Back, M. D.; Morina, N.; Buhlmann, U.; Kircher, T.; Dannlowski, U.; FOR2107 consortium, ; PRONIA consortium, ; MBB consortium, ; Hahn, T.; Opel, N.
Show abstract
Mental health research faces the challenge of developing machine learning models for clinical decision support. Concerns about the generalizability of such models to real-world populations due to sampling effects and disparities in available data sources are rising. We examined whether harmonized, structured collection of clinical data and stringent measures against overfitting can facilitate the generalization of machine learning models for predicting depressive symptoms across diverse real-world inpatient and outpatient samples. Despite systematic differences between samples, a sparse machine learning model trained on clinical information exhibited strong generalization across diverse real-world samples. These findings highlight the crucial role of standardized routine data collection, grounded in unified ontologies, in the development of generalizable machine learning models in mental health. One-Sentence SummaryGeneralization of sparse machine learning models trained on clinical data is possible for depressive symptom prediction.
Li, Z.; Wang, W.; Shahani, L. R.; Selek, S.; Vieira, R. M.; Soares, J. C.; Liu, H.; Huang, M.
Show abstract
Clinical phenotyping is the process of extracting patients observable symptoms and traits to better understand their disease condition. Suicide phenotyping focuses more on behavioral and cognitive characteristics, such as suicide ideation, attempt, and self-injury, to identify suicide risks and improve interventions. In this study, we leveraged the latest reasoning models, namely 4o, o1, and o3-mini, to perform note-level multi-label classification and reasoning generation tasks using previously annotated psychiatric evaluation notes from a safety-net psychiatric inpatient hospital in Harris County, Texas. Compared with the previously finetuned GPT-3.5 model, the out-of-box reasoning models prompted with in-context learning achieved comparable and better performance, with the highest accuracy of 0.94 and F1 of 0.90. We implemented novel clinical justification generation from these models on the traditional classification tasks. This finding marked a promising direction for performing clinical phenotyping that is interpretable and actionable using smaller, efficient reasoning models.